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Abstract — Analysis and recognition of images observed in 
different situations are the most important tasks of the visual 
cortex. Although many studies have been done in the field of 
computer vision and neuroscience, the underlying processes in 
visual cortex is not completely understood. An inspired model of 
the visual cortex that has lately gained attention for object 
recognition is HMAX, which describes a feed-forward 
hierarchical structure. This model shows a degree of scale and 
translation invariance. Other capabilities of the visual cortex are 
not that much sensitive against rotation and noise in color space. 
In this paper, we introduce a novel method to increase the degree 
of robustness against noise and rotation. We describe a 
hierarchical system that closely follows the organization of visual 
cortex and builds an increasingly complex and invariant feature 
representation with R, G, B and gray channels as inputs. Similar 
to the recently published methods, the phase of learning is done 
only in the S2 layer of HMAX structure. While in the proposed 
model instead of using directly the distance between the patches 
and the CI units, the distance between largest singular values of 
the patches and the CI units are used. These values behave 
indeed insensitive significantly with respect to rotation. Finally, 
we used the COREL datasets for experiments and showed that 
the proposed model have better performance than the previous 
HMAX models in complex visual scenes. 

Keywords — Robust Object Recognition; Visual Cortex; HMAX 
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I. Introduction 

The visual cortex is part of the brain for processing visual 
information. The discovery and analysis of cortical visual 
areas, major accomplishments of visual neuroscience, help us 
better understand how visual cortex is a critical question in 
neuroscience. Because humans and primates outperform the 
best machine vision systems with respect to almost any 
measure, building a system that emulates object recognition in 
cortex has always been an attractive but elusive and difficult 
goal [1]. In most cases, use of the visual neuroscience in the 
computer vision has been limited to early vision stages such as 
Stereo algorithms [2] , derivative of Gaussian as a filter and 
recently Gabor filters [3]. While there are some other 
approaches in the computer vision fully inspired and 



challenged by the human vision, the very first stages of 
processing in cells have not yet been passed [4-8]. 

Object recognition as one of the most advanced tasks of 
the visual cortex is very important for animals as well as for 
higher primates [9]. Although object recognition has received 
a great deal of attention within the field of computer vision, 
the underlying computational processes in the visual cortex 
are not completely understood and there is still a high degree 
of computational complexity when simulated in a computer. 
In short, object recognition is thought to be done by the 
ventral visual path from primary visual cortex, over 
extrastriate visual areas, to inferotemporal cortex, and to 
prefrontral cortex, PFC, which plays an important role in 
linking perception to memory [9]. This ventral visual path has 
a hierarchical architecture reflecting sensitivity to an 
increasing complexity of the preferred stimuli from simple 
cells in primary visual cortex to complex cells extrastriate 
visual areas [9, 10]. 

A neuroscientifically inspired model, which reflects the 
current understanding of the ventral visual path as a feed- 
forward hierarchical structure, recently gained attention in 
object recognition tasks is the HMAX model introduced by 
Reisenhuber and Poggio [11]; this model focuses more on 
designing simple operations inspired by the visual cortex and 
less on learning. Extending the work by Hubel and Weisel 
[12], the HMAX model, in general, possesses desirable 
properties observed in visual systems, and shows a significant 
degree of translation and scale invariance [10]. Serre et al. in 
[1] extended the original HMAX model to add multi-scale 
representations as well as more complex visual. It should be 
noted that the conditions under which this model functionally 
performs are synthetic and far from natural; however, recently 
the HMAX has been analyzed in natural scenes [13]. Although 
they are not strictly invariant to rotation [1], in this paper, we 
modify this model for natural scenes analysis in the rotating 
and noising cases by applying largest singular values and 
color spaces. To approve the capability of the proposed 
method, the COREL dataset consisting of natural images for 
are used in the paper. 
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The rest of the paper is organized as follows. Section II 
presents the structure and function of the HMAX model, 
Section III introduces the proposed model by using the RGB 
and gray level color spaces and largest singular values in the 
learning phase. Section IV presents the simulation results from 
the experiment carried out to test and compare these models 
under different conditions. Finally, section V concludes the 
paper. 

II. HMAX' s Structure and function 

The algorithm to implement this model performed for 
number of scale bands that each scale band determines the size 
of the filters employed and the number of units pooled (size of 
a local area for MAX operations into one scale band). This 
will be explained below. 

A. Standard Model 

According to Fig.l in its simplest form, the model consists 
of four layers of computational units, where the simple S units 
alternate with complex C units. The S units increase selectivity 
by tuning function [11, 12]. The C units pool their inputs 
through a maximum (MAX) operation to increase invariance in 
the spatial and scale. The model consists of several properties 
of cells along the ventral stream of visual cortex [14]. For 
instance, operations at the complex cells are similar to some 
behaviors of complex cells in VI [15] and V4 [16]. 
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Fig.l. Illustrate the standard HMAX model structure. SI layers consists of 
simple Gabor filters at several scales and orientation which indicated by 
ellipses with various of size and color (each color to mean one orientation). 
CI pools the filter output spatially (gray squares with a size of Ns) and across 
adjacent scales into scale band. S2 tuned to conjunctions of the orientations. 
C2 provides further spatial and scale invariance. The C2 outputs are directly 
fed to a classifier . 



B. Modified HMAX model 

This model is based on occurs learning in the all stages in 
the visual cortex. In the S2 layer is used of the learning to 
obtain a good performance in the robust object recognition [1] 
or combined with other features for scene classification [17]. 
Each S2 unit response depends in a Gaussian-like way on the 
Euclidean distance between a new input and a stored 
prototype (see Fig.2). That is, for an image patch X from the 
previous CI layer at a particular scale, the response r of the 
corresponding S2 unit is given by 

r= Q xp(-j3\\X-P i f) (1) 

Where P defines the sharpness of the tuning and P t is one of 
the centers (look like to the RBF units), extracted randomly. 

III. The proposed model 

The core of the proposed model focuses on increasing 
process tolerance against rotation and noise during the phase 
maximizing of color spaces and intensities. In this model, 
Starting point for the proposed model is decomposes an input 
image to four channels (R, G , B and intensity) and do Max 
operation over them. So, in the SI layer, we apply the Gabor 
filters (by using (2) and (3)), adjust the filter parameters, i.e. 
orientation 6 , effective width a , wavelength X . S fully 
discussed in [18], we then apply some of those filters given in 
Table.I to stimuli commonly used ones to probe VI neurons 
and to remove filters that are incompatible. We use also these 
filters to form a pyramid of scales in the o- orientations. These 
filters grouped in specified scale bands. 

(x2+y 2 yh 2tt 
G(x ,y) = exp( ) x cos(— x Q ) s X . (2) 
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Fig.2. The N patches in the S2 layer with CI format. Each orientation in the 
patch is matched to the corresponding orientation in CI. The results are one 
image per CI band and patch. The C2 values are computed by taking a MAX 
over all S2 associated with a given patch. Thus, the C2 response has length 
N. 



The CI units pooled using a maximum operation of a local 
area (NsxNs ) of the SI units from the same orientation and 
scale band. This pooling increases the tolerance against the 
shift. That is the response r of a complex unit (by using (4)) 
corresponding to the response of the strongest of its cells 
( x \ ,x 2 • • yX m ) from the previous S 1 layer. 

r = rmxxj (4) 
j=l,...,m 

For instance, we have several maps in a scale band, filters 
with various sizes. The maps have the same dimensionality. 
But they are the outputs of different filters. For pooling, one 
measurement is a maximum operation over the same spatial 
neighborhood of the filter's output. 

It is also proposed that for each S2 units, the largest singular 
values of both patches and CI units are computed. The 
prototypes P or patches with size of nxn and o-orientations 
are randomly extracted from the CI layers of training images 
and then stored. There are two reasons for using largest 
singular values as illustrated by the following two examples: 

Example 1 : Tolerance against rotations 
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Example 2: No use of the patch matrices 
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We assume that the patch is 



1 2 3 
4 5 6 
7 8 9 



, therefore 



and Ai= PxP or Ai is equal to 



1 2 3 
4 5 6 
7 8 

Now if we rotate Pi to 90° , P 2 is 



14 32 50 
32 77 122 
50 122 194 



3 6 9 
2 5 8 
1 4 7 



, A 2 is equal to 



126 108 90 
108 93 78 
90 78 66 



and then for 



the largest eigenvalue of A x and A 2 we have: 

(14-A)x i+32x 2+50x 3=0 
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S p ^ =4, S p ^ =2 , so that S p *S p ^ (Sis the singular value). 
For each CI image (output of a band), compute: 

2 

Yq. = exp(-£ 5(xQ i )S(Pj e . ) 



) (9) 



Which P( is patch extracted randomly. Then in the C2 layer, 
for each unit we compute: 



F =ZY . 
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(10) 



The size of Fj is equal to the number of patches. So, we 
32xi+(77-/L)x2+122x3=0 (5) ^ ave a com pi ex feature vector or F that its tolerance against 
50xi+122x2+(194^l)x3=0 rotate, noise, size and shift. 
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IV. Overall accuracy results from rotating and 

NOISE CASES 

( 6 ) We considered four orientations 0°, 45°, 90°, 135° and 

arranged the Si filters to form a pyramid of scales. According 
to Table. I we used the 4 scale bands which each consists two 
adjacent filter sizes (there are 4 scale band for a total of 8 filter 
sizes). In addition, Table. I for each of the scale bands 
determined the size of Si neighborhood or NsxNs. For all the 
Gabor filters used y=0.3, other parameters these filters defined 
in Table. I , for each the size of them. We considered 20 
patches with sizes of 10 which these randomly extracted from 
the training images. 



We extracted 50 complex feature vector of positive 
samples and 50 complex feature vector of negative samples as 
train set, which these selected randomly. The SVM trained 
over this set. For evaluate the proposed method in noisy cases, 
the images relate to this set changed by pepper and salt noise 
with 0.1 value. Then, test set consists of the complex feature 
vectors obtained from the noising images. According to 
Table.II, for five iterations (in each iteration again the training 
images are selected) the test sets are given to the trained 
SVMs and the performance based on average mean square 
errors (MSE) is measured. 

Again similar to the noise cases, we extracted 50 complex 
feature vectors of positive samples and 50 complex feature 
vector of negative samples as train set selected randomly. The 
SVM trained over this set. To evaluate the proposed method in 
rotating cases, the images relate to this set changed by rotation 
1 to 180, randomly. Then, the test sets given to the trained 
SVMs and measurement performance based on average mean 
square errors (Table. III). 

TABLE. L Parameters used in our implementation for the SI and CI layers. 



For CI layer For SI layer 



Scale band 
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size 
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TABLE. II. Comparison of the our proposed model vs. other HMAX models 
by average mean square errors (MSEs) over 5 rounds of Test at noise cases on 
the COREL date set. 



Image category 


The our 
proposed 
model 


The 
modified 
HMAX 
[1] 


The 
standard 
HMAX 

[11] 


Affirm npnnlp 


76% 


40% 


52% 


Beaches 


78% 


59% 


45% 


Buildings 


82% 


65% 


58% 


Buses 


82% 


65% 


46% 


Dinosaurs 


78% 


79% 


43% 


Elephants 


77% 


62% 


49% 


Flowers 


69% 


50% 


53% 


Horses 


70% 


66% 


36% 


Mountains 


82% 


60% 


51% 


Foods 


73% 


49% 


54% 


Average 


76% 


59% 


48% 



Because the previous models used the MAX operation for 
choosing components values of the feature vector (F), there 
is a high probability that they choose the noisy values. This 
means that one noisy pixel in the local area for pooling has 
the maximum value, this value may be transmitted by MAX 
operation from the SI layer to C2 layer and in the last, it is 
selected as a component of F. However, the proposed model 
in the last layer uses the relationship between all the values 
(singular value). The low percentage of the Standard 
HMAX model [11] indicates disability of the this model for 
analysis and description of complex visual scenes. 



TABLE. III. Comparison of the our proposed model vs. other HMAX models by 
Average mean square errors (MSEs) over 5 rounds of Test at rotating cases on the 
COREL date set. 



category 


r 1 1 Vi o mil* 
1 Ilv UUl 

proposed model 


The 
modified 
HMAX 


111c SldlltldiU 

HMAX 
[11] 


Africa 
people 


75% 


70% 


61% 


Beaches 


78% 


57%o 


53%o 


Buildings 


ID/0 


TOO/ 
1 I/O 


15/0 


Buses 


91% 


62% 


55%) 


Dinosaurs 


80%) 


52% 


63%) 


Elephants 


69%o 


47%o 


54%o 


Flowers 


TAO/ 

/Uto 


JJ /o 




Horses 


73% 


61% 


64% 


Mountains 


71% 


58% 


62% 


Foods 


74% 


73% 


71% 


Average 


75% 


60% 


61% 



It can be seen from Table. II and Table. Ill, among all 
images in the 10 categories, our model achieves the best 
overall performance in describing Images. Similar to the 
visual system of the brain, this model inspired of the visual 
cortex has capability tolerance against noise and rotation. 
With attention to using the color in visual cortex into the VI 
area, we used the color channels. Therefore, the obtained 
features have some advantages besides the aforementioned 
weaknesses. A weakness can be used from the small number 
of patches to reduce time complexity and the image 
information lost. To better illustrate the performance of our 
method in recognizing the images corrupted by noise and 
rotation, we used the images of Africa and beaches 
categories and plotted the ROC curves of two different 
models in the 5 iterations for the noisy cases (see Fig.4) and 
rotating cases (see Fig. 5), respectively. 




Fig.4. Comparison of the ROC curves of two different models in the 
noising cases: (a) Africa (b) Beaches categories in 5 rounds 




Fig. 5. Comparison of the ROC curves of two different models in the 
rotating cases: (a) Africa (b) Beaches categories in 5 rounds 

A. Speed 

The time complexity of the singular value 
decomposition (SVD) is 0(m 2 n+n 2 )+k which m and n are 
the dimensions of the patch (k is computed as the SVD of 
S2's output) . We applied the square patches, the time 
complexity is 0(n 3 +n 2 ). On the other hand, the time 
complexity for Euclidean distance discussed in [1] was used 
with a complexity of 0(kjnj 2 ) in which k is the magnitude 
S2's output to patch of size 4 patches, j=l, 2, 3, 4. 

If we assume k and kj equal to 1 , for this model n=5 and 
number of the patches becomes 20, q=20 , then 



Ti = q( n 3 +n 2 )=20(5 3 +5 2 )=20x 1 50=3000 

And for the modified HMAX [1], use rii=4 ? n 2 =8, n 3 =12, 
n 4 =16 and numbers of the patches q=250 and obtain 

T 2 =q( nf+wf +«3 +«4 )=250(4 2 +8 2 +12 2 +16 2 )=250(480)=120000 

Therefore we have Ti«T 2 . 

Table. IV lists only in average the time of creating 
feature vector (in Matlab 7 on a Pentium(R) Dual Core 
T4200+PC running the windows XP operating system with 
2G memory) for one train sample. In this case (in tolerance 
against rotation and noise), we have seen the proposed 
model possesses a higher speed than that of the modified 
HMAX model [1]. 

For the proposed model and other models the training 
time is mainly spent in two aspects: (1) to generate a 
collection of feature vector to construct a training set, (2) to 
do SVM training over feature vectors of images. Because, 
the dimension of the previous model is high, it incur 
significantly higher computational cost than that of the 
proposed model as shown in Table. V. 

Conclusions and future work 

In this paper, we presented a new model with a 
framework similar to the previous models that inspired by 
visual cortex. The proposed model, in which the capabilities 
of visual cortex for scene recognition for noisy and complex 
environment are preserved, employed color channels and 
largest singular values of patches in the HMAX structure. 
The proposed model first computes a set of rotating, noise 
and shift invariant, scale- free features from a training set of 
images. Then, a standard discriminative classifier on the 
vector of features obtained from the input images is applied. 
Our approach exhibited higher performance on the COREL 
images set in compared to other models. A weakness of the 
proposed model is its high time complexity for computing 
singular values; to remedy this, and we had to use a smaller 
number of patches. This will also be pursued in our future 
research. 

TABLE. IV. Comparison time of the our proposed model vs. other model for 
creating one feature vector (in seconds) 



The our 


The modified 


proposed 


HMAX 


model 


model [1] 


1.29172316 


9.89557172 


seconds 


seconds 



TABLE. V. Comparison time of the our proposed model vs. other model for 
creating training set and SVM training (in seconds) 



Steps 


The our 


The modified 




proposed 


HMAX model 




model 


[1] 


To 






generate 


64.586158 


494.778586 


training 
set 


seconds 


seconds 


For SVM 


0.03346 


0.03590 


training 


seconds 


seconds 
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